Text mining and analytics were conducted on a dataset of scraped Reliefweb.int (RW) articles on Ukraine. Articles were limited to the year 2022 and the English language. A total of 3,895 documents were scraped, tokenised and all stop words (the, a, we, can) were removed.
As can be seen from the scatterplot above, the types of actor – UN, NGOs, donors – are indicated by the colour of each circle. On the x-axis is the number of documents produced that year by each agency/contributor to reliefweb in 2022 (on a log scale). The y-axis shows th number of sectors/themes each agency contributed to. The number of documents produced is also shown by the size of each circle.
Let us take a look at the most common word pairs inside the text of the scraped articles. Only more common word pairs have been included and the thickness of the line between them indicates the number of times this pair appears in the corpus. The results are not unanticipated.
If I knew nothing about the situation in Ukraine, from this graph, I
can glean that there is a war and a humanitarian response to it. I can
see sectors being mentioned (food, water, health, protection, education
– is shelter apart because the government is handling it?). I can also
see that there is displacement and refugees. The word
million shows up, as does scale.
Let us, finally, take a macro-view of the dataset and plot, in a bit more detail, the correlations between word pairs – bigrams – within the corpus. So that we may get a lay of the land, so to speak.
This network graph is not only much more complex, but it is also formed of word pairs – bigrams – as this tends to improve interpretability at the cost of sensitivity. However, now the main patterns in the RW corpus are visible. This is the lay of the land, so to speak.